Classification
Core Concept
Classification is a supervised learning task where the goal is to assign input data to one of a finite set of discrete categories or classes. Given labeled training examples, a classification model learns decision boundaries or probability distributions that partition the input space, enabling it to predict which category a new, unlabeled instance belongs to. Unlike regression, which predicts continuous numerical values, classification produces categorical outputs – discrete labels representing qualitatively different categories rather than points on a numerical scale.
The Learning Process
The classification process involves training a model on examples where both features (input variables) and class labels (categorical outputs) are known. The model learns patterns that distinguish between classes – for instance, differentiating spam from legitimate emails based on word frequencies, identifying handwritten digits from pixel values, or diagnosing diseases from medical test results. Learning algorithms adjust internal parameters to minimize classification error on training data, using optimization techniques like gradient descent or information-theoretic criteria like entropy reduction. During inference, the trained model receives new inputs and assigns them to the most probable class based on learned patterns.
Decision Boundaries
How classification models partition the feature space to separate different classes
- Linear classifiers – Logistic regression and linear SVMs create straight-line or hyperplane boundaries, suitable when classes are linearly separable in the feature space
- Non-linear classifiers – Decision trees, kernel SVMs, and neural networks can learn arbitrarily complex boundaries, capturing intricate patterns but risking overfitting to training data
- Margin – The distance between the decision boundary and nearest training examples affects generalization, with larger margins typically indicating more robust classification
- Overlapping classes – Some problems have no perfect boundary where classes intermingle, requiring probabilistic approaches that model uncertainty rather than hard separation
Prediction Types
Different forms of output that classification models can produce
- Hard classification – Assigns each instance to a single class deterministically, providing a definitive label without uncertainty information
- Soft classification (probabilistic) – Provides confidence scores or probability distributions across all possible classes, offering richer information for downstream decision-making. For example, a medical diagnostic system might report "70% probability benign, 30% probability malignant" rather than a single verdict
- Threshold tuning – With probabilistic outputs, the decision boundary can be adjusted based on the relative costs of different error types without retraining the model
Evaluation Metrics
Methods for measuring and comparing classification model performance
- Accuracy – Percentage of correct predictions; intuitive but misleading for imbalanced datasets where a naive classifier predicting only the majority class achieves high accuracy while providing no value
- Precision – Fraction of positive predictions that were correct; answers "of all instances we predicted as positive, how many actually were?"
- Recall (Sensitivity) – Fraction of actual positives that were identified; answers "of all actual positive instances, how many did we find?"
- F1-score – Harmonic mean of precision and recall, providing a single metric that balances both concerns
- Confusion matrix – Visualizes performance across all class pairs, revealing which classes are frequently confused with each other
- ROC curves and AUC – ROC plots true positive rate against false positive rate across different classification thresholds; AUC (area under curve) provides a threshold-independent performance measure particularly useful for binary classification and imbalanced problems
Common Challenges
Class imbalance, where some classes vastly outnumber others, causes models to bias toward frequent classes, requiring resampling techniques or cost-sensitive learning. High-dimensional feature spaces (many input variables relative to training examples) risk overfitting and computational expense, necessitating dimensionality reduction or regularization. Noisy or incorrect labels in training data corrupt the learning process, especially problematic when labeling is subjective or error-prone. Overlapping class distributions create fundamental ambiguity where perfect classification is impossible, requiring probabilistic handling. Concept drift occurs when the relationship between features and labels changes over time, degrading deployed model performance. Feature selection and engineering – choosing or constructing relevant input variables – often determines success more than algorithm choice.
Sub-types
Classification sub-types are distinguished by the number of classes, whether classes are mutually exclusive, the presence of ordering among classes, and the distribution balance of training examples across classes.
- Binary Classification – Classification with exactly two mutually exclusive classes (positive/negative, true/false, spam/not spam).
- Multi-Class Classification – Classification with three or more mutually exclusive categories where each instance belongs to exactly one class.
- Multi-Label Classification – Classification where instances can belong to multiple classes simultaneously (e.g., tagging a document with multiple topics).
- Ordinal Classification – Classification with ordered categories where the relative ranking between classes is meaningful (e.g., rating scales, disease severity levels).
- Imbalanced Classification – Classification tasks where the distribution of classes is heavily skewed, with some classes significantly underrepresented in the training data.